Add Buildkite pipeline for AI E2E tests (simulator-llm-pilot gem)#25444
Add Buildkite pipeline for AI E2E tests (simulator-llm-pilot gem)#25444
Conversation
Generated by 🚫 Danger |
|
| App Name | WordPress | |
| Configuration | Release-Alpha | |
| Build Number | 32185 | |
| Version | PR #25444 | |
| Bundle ID | org.wordpress.alpha | |
| Commit | 146daa1 | |
| Installation URL | 7l99o9dpqifu8 |
|
| App Name | Jetpack | |
| Configuration | Release-Alpha | |
| Build Number | 32185 | |
| Version | PR #25444 | |
| Bundle ID | com.jetpack.alpha | |
| Commit | 146daa1 | |
| Installation URL | 04hfgpqarjii0 |
1602fa9 to
8589139
Compare
|
|
Hi @iangmaia , shall we land this and start running nightly jobs? |
The gem provides a sandboxed agent that drives the simulator through a fixed set of tools (tap, swipe, type, REST API) with no arbitrary code execution. It handles WDA lifecycle, session management, context window compression, and verification/cleanup enforcement internally. The Buildkite step: - Checks for "Testing" label (skips if missing) - Downloads build artifacts and installs app on simulator - Installs the simulator-llm-pilot gem from GitHub - Runs all test cases in Tests/AgentTests/ui-tests/ Ref: AINFRA-2176 Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
gem build resolves spec file paths relative to cwd, so bin/simulator-llm-pilot wasn't found when building from the wordpress-ios repo root. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Extract WDA build to a separate build-wda.sh script for clarity. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
The gem no longer hardcodes WordPress login flow in its system prompt. Add app-instructions.md with the WordPress/Jetpack login flow and pass it via --app-instructions-file. Also pass --app-name so the LLM knows the app's display name. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
efadbc8 to
146daa1
Compare
|
@crazytonyli Hey Tony! With all the recent changes and updates this got left behind 😓 sorry about that. There's not much work left to start running it and iterating on the tests IMO, so that's the good side. As mentioned in the P2, I think that this PR + simulator-llm-pilot is the way to go for E2E AI tests, but it would be nice to make it fully 🟢 to start with. |
There was a problem hiding this comment.
Pull request overview
Adds a Buildkite CI step to run AI-driven end-to-end UI tests on an iOS Simulator using the simulator-llm-pilot gem, including helper scripts for installing the gem, locating/booting a simulator, and building WebDriverAgent.
Changes:
- Adds a new Buildkite pipeline step (PR-only) to run AI E2E tests and upload
Tests/AgentTests/results/**artifacts. - Introduces CI scripts to install
simulator-llm-pilot, find a booted simulator, build WebDriverAgent, install the app, and run the test suite. - Updates AI test/navigation skill docs and adds app login instructions used by the test runner.
Reviewed changes
Copilot reviewed 8 out of 8 changed files in this pull request and generated 6 comments.
Show a summary per file
| File | Description |
|---|---|
Tests/AgentTests/app-instructions.md |
Adds login-flow instructions for the agent-runner to avoid unsafe/manual credential entry. |
Scripts/ci/install-simulator-llm-pilot.sh |
Installs simulator-llm-pilot by building from a local checkout or cloning from GitHub. |
Scripts/ci/find-booted-simulator.rb |
Helper to return a booted simulator UDID (optionally waiting/polling). |
.claude/skills/ios-sim-navigation/SKILL.md |
Aligns documentation placeholder naming (<APP_BUNDLE_ID>). |
.claude/skills/ai-test-runner/SKILL.md |
Aligns documentation placeholder naming (<APP_BUNDLE_ID>). |
.buildkite/pipeline.yml |
Adds a new “AI E2E Tests” Buildkite step gated to PR builds. |
.buildkite/commands/run-ai-e2e-tests.sh |
Orchestrates artifact download, simulator/app setup, WDA build, and simulator-llm-pilot run. |
.buildkite/commands/build-wda.sh |
Clones/builds WebDriverAgent and skips rebuild when artifacts already exist. |
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
| SIMULATOR_LLM_PILOT_REPO_URL="${SIMULATOR_LLM_PILOT_REPO_URL:-https://github.com/Automattic/simulator-llm-pilot.git}" | ||
| SIMULATOR_LLM_PILOT_SOURCE_PATH="${SIMULATOR_LLM_PILOT_SOURCE_PATH:-}" | ||
|
|
||
| build_dir="$(mktemp -d)" | ||
| trap 'rm -rf "$build_dir"' EXIT | ||
|
|
||
| source_path="${SIMULATOR_LLM_PILOT_SOURCE_PATH}" | ||
| if [[ -z "$source_path" && -f "${DEFAULT_LOCAL_GEM_PATH}/simulator-llm-pilot.gemspec" ]]; then | ||
| source_path="${DEFAULT_LOCAL_GEM_PATH}" | ||
| fi | ||
|
|
||
| if [[ -n "$source_path" ]]; then | ||
| echo "Using local simulator-llm-pilot source at ${source_path}" | ||
| if [[ -d "${source_path}/.git" ]]; then | ||
| source_revision="$(git -C "${source_path}" rev-parse HEAD)" | ||
| git -C "${source_path}" archive HEAD | tar -x -C "$build_dir" | ||
| else | ||
| source_revision="local-filesystem" | ||
| tar -cf - -C "${source_path}" . | tar -xf - -C "$build_dir" | ||
| fi | ||
| else | ||
| echo "Cloning simulator-llm-pilot from ${SIMULATOR_LLM_PILOT_REPO_URL}" | ||
| git clone --depth 1 "${SIMULATOR_LLM_PILOT_REPO_URL}" "$build_dir" | ||
| source_revision="$(git -C "$build_dir" rev-parse HEAD)" | ||
| fi |
| WEBDRIVERAGENT_REPO_URL="${WEBDRIVERAGENT_REPO_URL:-https://github.com/appium/WebDriverAgent.git}" | ||
| WEBDRIVERAGENT_REF="${WEBDRIVERAGENT_REF:-}" |
| ensure_wda_checkout | ||
|
|
||
| if [[ -d "$WDA_PROJECT" ]] && has_built_artifacts; then | ||
| echo "WebDriverAgent already built, skipping." | ||
| exit 0 |
| TIMESTAMP="$(date +%Y-%m-%d-%H%M)" | ||
| RESULTS_DIR="Tests/AgentTests/results/${TIMESTAMP}" |
| UDID="$(ruby Scripts/ci/find-booted-simulator.rb "$SIMULATOR_NAME" 2>/dev/null || true)" | ||
| if [[ -z "$UDID" ]]; then | ||
| echo "No booted simulator named '$SIMULATOR_NAME' found. Booting..." | ||
| xcrun simctl boot "$SIMULATOR_NAME" 2>/dev/null || true | ||
| UDID="$(ruby Scripts/ci/find-booted-simulator.rb "$SIMULATOR_NAME" 30 1 2>/dev/null || true)" |
| output, status = Open3.capture2('xcrun', 'simctl', 'list', 'devices', 'booted', '-j') | ||
| exit 1 unless status.success? |





Summary
The gem handles everything internally: simulator detection, WDA lifecycle, agent loop with sandboxed tools, context window compression, verification/cleanup enforcement, and structured results.
Alternative approach: see #25443 for a Claude Code + wrapper scripts version of the same pipeline.
Ref: AINFRA-2176
Test plan
.buildkite/commands/run-ai-e2e-tests.shlocally with a booted simulator and test site credentialsusers-screen-loads.md) end-to-endresults.mdis written with correct pass/fail status🤖 Generated with Claude Code